A Write-friendly Store for SILT: Part I

Learn how to design a key-value store for fast writing.

Preamble#

In the last lesson, we decided to use multiple stores in our design, each optimized for a particular purpose. These stores include a write-friendly store, intermediary stores, and a memory-efficient store. Collectively, these stores will help our design achieve good performance and low memory overhead per key. Let's dive into our first store: the write-friendly store.

Our write-friendly store will handle all PUT and DELETE requests. The PUT requests represent new entries or update old entries. The DELETE requests are special operations that request the removal of a key-value entry. For the GET request, we'll later use a higher-level wrapper to find a key in all the stores. Doing so is necessary because, over time, keys trickle from a write-friendly store to a memory-efficient store.

Appending to storage log for fast write#

Memory provides fast random access. However, storing all the key-value data in memory will need a lot of RAM (which is fairly expensive and might not be enough to hold all the data). Writing to storage is slow, and storage generally does not perform well with random writes. However, sequential writing to storage is faster than random writing.

Before moving on, let's refresh the difference in random and sequential writing performance for the most commonly used storage hardware.

Random vs. sequential writing#

We will not dive into the details of random vs. sequential access. We will compare the differences in random and sequential performances of some modern hardware to emphasize the importance of the considerations made in our design.

Note: We prefer writing and reading traditional rotating disks and newer flash-based disks sequentially for different reasons. Sequential writing is faster on the rotating disks because doing so avoids unnecessary head-seek latencies, and for flash-based disks we can better utilize the storage cells to increase its overall lifespan. Additionally, sequential operations enable us to benefit from on-disk caches that can prefetch data next in line.

The following table lists all the hardware we selected for this analysis. Except for the first disk, all other disks are different kinds of SSDs. For the following table, all conversions from IOPS to MB/s assumes 4KB blocks.

Name (Capacity) Application Sequential Write Performance Random Write Performance
Western Digital Blue (1 TB) PCs Up to 130 MB/s ~= 800 IOPS Up to 53 MB/s ~= 450 IOPS
Samsung 980 (1 TB) PCs Up to 3,000 MB/s ~= 0.73M IOPS Up to 2,000 MB/s ~= 0.48M IOPS
Samsung 990 PRO (1 TB) PCs and Game Consoles Up to 6,900 MB/s ~= 1.68M IOPS Up to 6,300 MB/s ~= 1.55M IOPS
Intel® Optane™ SSD DC P4800X Series (750 GB) Data Centers Up to 2,200 MB/s ~= 0.54M IOPS Up to 2,300 MB/s ~= 0.55M IOPS
Intel® Optane™ SSD DC P5800X Series (800 GB) Data Centers Up to 6,100 MB/s ~= 1.49M IOPS Up to 5,500 MB/s ~= 1.35M IOPS

On the official websites of the above-mentioned storage products, their manufacturers have listed sequential write performance in MB/s and random write performance in IOPS. Why is that?

This is because sequential writing almost only requires writing operations. The given rating is for when the address to write is known, and the only work left is to write the data onto the drive. Random writing is different. In the worse case, every write operation will require at least one read operation. Therefore, the rating for random write performance is listed in IOPS rather than MB/s. In the best case, an entire random write operation is essentially sequential–the addresses of a sequence of random writes are one after the other.

Based on the information above, we can think of the number of IOPS in the sequential write performance column as write operations for 4KB blocks per second. The number of MB/s in the random write performance column is the data that can be simultaneously read and written for 4KB blocks for random writing.

The first thing to observe is that the rotating disk's performance is at least two orders of magnitude less than SSD-based disks. That is understandable, given that there are mechanical moving parts in rotating disks, while the SSDs are all electronics (old-style hard disks are still popular because they provide excellent GB storage per dollar compared to SSDs).

Second, except for the first and second rows, there isn't much difference in performance for sequential and random writes. On rotating disks,which are very common, there is at least an order of magnitude difference between sequential and random writes. SSDs often need to erase full blocks before rewriting the new data. Doing so excessively reduces the lifespan of such disks due to a limited number of times they can be overwritten.

The message here is that sequential writing is good: for a rotating disk, we get good performance with sequential writing, and for SSDs, sequential writing helps to extend their lifespan.

Writing sequentially#

We’ll write sequentially to storage. Our write-friendly store will append all PUT and DELETE requests to an in-storage log making the writing process fast.

Storage
Storage
PUT
K6, V6
PUT...
DELETE
K3
DELETE...
PUT
K7, V7
PUT...
PUT
K4, V4
PUT...
DELETE
K4
DELETE...
PUT
K3, V3
PUT...
Incoming requests
Order of arrival is right to left
Incoming requests...
Write sequentially
Write sequentially
PUT
K2, V2
PUT...
Stored key-value entries
Stored in the order they were inserted
Stored key-value entries...
New entry
Appended at the end of the log
New entry...
Not processed
Not processed
In progress
In progress
Processed
Processed
Node (Server)
Node (Server)
K3, V3
K3, V3
K4, DELETE
K4, DELETE
K4, V4
K4, V4
K2, V2
K2, V2
Log size increases sequentially
Log size increases sequentia...
Viewer does not support full SVG 1.1
Sequential writing to the in-storage log

In-memory hash table#

The process above presents a problem. Accessing a key would require O(n)O(n) time, where nn is the number of key-value entries stored in the key-value store. We’ll have to search for the key in the unsorted log for its entry. The following reasons require us to have faster access to specific keys.

  • We may need to search for keys for GET requests.

  • We will need to access keys later to move our entries to memory-efficient stores.

A major consideration we need to make is that this store will contain its keys' latest (most updated) values. For GET requests, we should look for the key in this store first to avoid returning an incorrect or outdated value (from the subsequent stores).

We cannot afford a linear search in this store for all GET requests and we need to be able to access our keys in constant time. Furthermore, since this store will contain the least proportion of overall stored key-value entries (because we would have moved keys to a more memory-efficient store), this store might not have the required key stored most of the time. So, in addition to returning stored values faster, we also need to reject requests in constant time too in case the key is not stored.

To solve the above problem, we will maintain an in-memory hash table that will map keys to their offset in the log; an entry in the hash table is added right after a successful entry in the in-storage log. Now we can access all our keys in the store in O(1)O(1) time.

This hash table also serves as an in-memory filter. The hash function will give us the hash value for the key we are looking for, and if that hash value does not exist in the hash table, the key is not present in this store. This eliminates any storage seeks required for keys that do not exist in the write-friendly store, resulting in lower read-amplification. We will explore this further in the GET request lesson.

Storage
Storage
PUT
K6, V6
PUT...
DELETE
K3
DELETE...
PUT
K7, V7
PUT...
PUT
K4, V4
PUT...
DELETE
K4
DELETE...
PUT
K3, V3
PUT...
Write sequentially
Write sequentially
PUT
K2, V2
PUT...
Stored key-value entries
Stored in the order they were inserted
Stored key-value entries...
New entry
Appended at the end of the log
New entry...
K3, V3
K3, V3
K4, DELETE
K4, DELETE
K4, V4
K4, V4
K2, V2
K2, V2
Incoming requests
Incoming requests
Discarded after processing
Discarded after proc...
Memory
Memory
Hash
Hash
Offset
Offset
h(K2)
h(K2)
3
3
h(K3)
h(K3)
0
0
h(K4)
h(K4)
2
2
Store offset in hash table
Store offset in hash...
Node
Node
Dedicated hash table with all hash values for keys that are present in store
Dedicated h...
Log size increases sequentially
Log size increases sequentia...
Viewer does not support full SVG 1.1
Log in storage with an in-memory hash table to store offset.

High-level Design of SILT

A Write-friendly Store for SILT: Part II